4 Jensen's Inequality, Entropy

1 Convex Function

Convex Function

A function g:(a,b)R is convex if (2.1)g(λx1+(1λ)x2)λg(x1)+(1λ)g(x2),
x1,x2(a,b) and λ[0,1].
g is called strictly convex if the equality holds only for λ=0,1, or x1=x2. I.e., the graph of g over (a,b) contains no straight lines.

Pasted image 20241128122332.png|500

Now consider three points P1,P2,P3R2. The convex hull of {P1,P2,P3} is {(x,y)R2|(x,y)=λ1P1+λ2P2+λ3P3,λ1+λ2+λ3=1,λi0.}
This is a triangular area of the three points. If g is convex, by the figure g(λ1x1+λ2x2+λ3x3)λ1g(x1)+λ2g(x2)+λ3g(x3),
Pasted image 20241128123143.png|400
We can conclude for multiple points:

2 Jensen's Inequality

Theorem (Jensen's Inequality)

A convex function g:(a,b)R satisfies (2.2)g(i=1nλixi)i=1nλig(xi), for all λ1,,λn satisfying i=1nλi=1,λi0,i.

Corollary

Let X be a R valued discrete RV and g a convex function. Then (2.3)g(E[X])E[g(X)]. if g is strictly convex and X is not constant, then g(E[X])E[g(X)].

3 Entropy

Suppose P,Q are two probability measures on (Ω,F), and X a discrete RV, s.t. P(X=x)=p(x),xRange(X);Q(X=x)=q(x),xRange(X).

Shannon Entropy

H(P)=xp(x)logp(x)=Ep[logp(X)].

Cross Entropy

The cross entropy of Q relative to P is H(P,Q)=xp(x)logq(x)=Ep[logq(X)].

KL Divergence

The KL divergence (Kullback-Leibler divergence) of P from Q is KL(P||Q)=H(P,Q)H(P)=xp(x)log[q(x)p(x)]=Ep[logq(X)p(X)].

For continuous RV, replace p(x) with p.d.f and x with dx. I.e.
In general, KL(p||q)KL(q||p).
We can use Jensen's inequality to prove KL divergence is non-negative. See below.

To solve the problem of asymmetry, we define JS divergence:

JS Divergence

Denote M=12(P+Q), then define JS divergence JS(P||Q)=12KL(P||M)+12KL(Q||M).

.

See more discussion on entropy in Maximum Entropy Model, information gain from Decision Tree.

4 ELBO

X: observed data. Z: latent variable. q: a distribution over Z. Then

KL(q||p)=Eq{log[p(Z|X,θ)q(Z)]},

then logp(X|θ)log-loss=L(q,X,θ)ELBO+KL(q||p)KL divergence, here L(q,X,θ)=Eq{log[p(X,Z|θ)q(Z)]} is called Evidence Lower Bound (ELBO).

Claim

KL(p||q)0, and "=" holds iff p(x)=q(x),x.

Key facts for ELBO:

  1. L(q,X,θ) is a lower bound on the log-likelihood, since by above KL(q||p)0,p,q.
  2. ELBO is much easier to compute or optimize than logp(X|θ).